Introduction

The hands-on exercise for this week focuses on: 1) estimating a topic model ; 2) interpreting and visualizing results. Remember that you will need to: 1) comment your code and 2) write out the interpretation of your results.

You will learn how to:

Setup

Before proceeding, we’ll load the packages we will need for this tutorial.

library(tidyverse) # loads dplyr, ggplot2, and others
library(stringr) # to handle text elements
library(tidytext) # includes set of functions useful for manipulating text
## Warning: package 'tidytext' was built under R version 4.3.2
library(topicmodels) # to estimate topic models
## Warning: package 'topicmodels' was built under R version 4.3.3
library(gutenbergr) # to get text data
## Warning: package 'gutenbergr' was built under R version 4.3.3
library(scales)
library(tm)
## Warning: package 'tm' was built under R version 4.3.3
library(ggthemes) # to make your plots look nice
## Warning: package 'ggthemes' was built under R version 4.3.2
library(readr)
library(quanteda)
## Warning: package 'quanteda' was built under R version 4.3.2
## Warning in .recacheSubclasses(def@className, def, env): undefined subclass
## "ndiMatrix" of class "replValueSp"; definition not updated
library(quanteda.textmodels)
## Warning: package 'quanteda.textmodels' was built under R version 4.3.3

You may need to install the preText package if you haven’t done so yet. For that you will need to run the next code chunk (it is currently set to ‘eval=F’, which tells R ‘do not execute this code chunk’). That package is not readily available on through RStudio directly. It needs to be downloaded from the Github repository set up by its creater Matthew J Denny. We can do that using the command install_github(). This command is part of the ‘devtools’ package, which you will need to install as well (if you haven’t done so already). The devtools package is directly available through R so it can be installed with the usual command install_packages.

#install_package(devtools)
devtools::install_github("matthewjdenny/preText")
library(preText)
## Warning in .recacheSubclasses(def@className, def, env): undefined subclass
## "ndiMatrix" of class "replValueSp"; definition not updated

Data collection

We’ll be using data from Alexis de Tocqueville’s “Democracy in America.”

We have already downloaded some data for you, but we also included the code to download it yourself (it is currently set to ‘eval=F’ so it won’t run unless you remove the eval=F argument or you run the chunk directly.

The code downloads these data, both Volume 1 and Volume 2, and combine them into one data frame. For this, we’ll be using the gutenbergr package, which allows the user to download text data from over 60,000 out-of-copyright books. The ID for each book appears in the url for the book selected after a search on https://www.gutenberg.org/ebooks/.

This example is adapted by Text Mining with R: A Tidy Approach by Julia Silge and David Robinson.

Here, we see that Volume of Tocqueville’s “Democracy in America” is stored as “815”. A separate search reveals that Volume 2 is stored as “816”.

#USING THIS DATA
tocq <- gutenberg_download(c(815, 816), 
                            meta_fields = "author")
## Determining mirror for Project Gutenberg from https://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org

Or we can read the dataset we already downloaded for you in the following way:

tocq  <- readRDS(gzcon(url("https://github.com/cjbarrie/CTA-ED/blob/main/data/topicmodels/tocq.RDS?raw=true")))

Once we have read in these data, we convert it into a different data shape: the document-term-matrix. We also create a new columns, which we call “booknumber” that recordss whether the term in question is from Volume 1 or Volume 2. To convert from tidy into “DocumentTermMatrix” format we can first use unnest_tokens() as we have done in past exercises, remove stop words, and then use the cast_dtm() function to convert into a “DocumentTermMatrix” object.

#convert to different data shape DTM
tocq_words <- tocq %>%
  mutate(booknumber = ifelse(gutenberg_id==815, "DiA1", "DiA2")) %>%
  unnest_tokens(word, text) %>% #lowercase words and remove punctuation
  filter(!is.na(word)) %>% #remove stop words
  count(booknumber, word, sort = TRUE) %>% #making new column (booknumber) for whether term is from v1 or v2 of the book
  ungroup() %>%
  anti_join(stop_words)#again stop words?
## Joining with `by = join_by(word)`
tocq_dtm <- tocq_words %>%
  cast_dtm(booknumber, word, n) #convert to dtm

tm::inspect(tocq_dtm)
## <<DocumentTermMatrix (documents: 2, terms: 11989)>>
## Non-/sparse entries: 17420/6558
## Sparsity           : 27%
## Maximal term length: 18
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   country democratic government laws nations people power society time
##   DiA1     353        213        531  395     231    499   540     291  309
##   DiA2     167        561        162  133     313    360   263     241  309
##       Terms
## Docs   united
##   DiA1    556
##   DiA2    227

We see here that the data are now stored as a “DocumentTermMatrix.” In this format, the matrix records the term (as equivalent of a column) and the document (as equivalent of row), and the number of times the term appears in the given document. Many terms will not appear in the document, meaning that the matrix will be stored as “sparse,” meaning there will be a preponderance of zeroes. Here, since we are looking only at two documents that both come from a single volume set, the sparsity is relatively low (only 27%). In most applications, the sparsity will be a lot higher, approaching 99% or more.

Estimating our topic model is then relatively simple. All we need to do if specify how many topics that we want to search for, and we can also set our seed, which is needed to reproduce the same results each time (as the model is a generative probabilistic one, meaning different random iterations will produce different results).

#set seed and specify how many topicsto search for (10)
tocq_lda <- LDA(tocq_dtm, k = 10, control = list(seed = 1234))

After this we can extract the per-topic-per-word probabilities, called “β” from the model:

#extract the beta
tocq_topics <- tidy(tocq_lda, matrix = "beta")

head(tocq_topics, n = 10)#show top 10
## # A tibble: 10 × 3
##    topic term          beta
##    <int> <chr>        <dbl>
##  1     1 democratic 0.00735
##  2     2 democratic 0.00494
##  3     3 democratic 0.0169 
##  4     4 democratic 0.00200
##  5     5 democratic 0.00434
##  6     6 democratic 0.00509
##  7     7 democratic 0.00140
##  8     8 democratic 0.00840
##  9     9 democratic 0.0138 
## 10    10 democratic 0.00968

We now have data stored as one topic-per-term-per-row. The betas listed here represent the probability that the given term belongs to a given topic. So, here, we see that the term “democratic” is most likely to belong to topic 4(topic 3 using our data). Strictly, this probability represents the probability that the term is generated from the topic in question.

We can then plots the top terms, in terms of beta, for each topic as follows:

#plot beta for each topic (top terms)
tocq_top_terms <- tocq_topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

tocq_top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free", ncol = 4) +
  scale_y_reordered() +
  theme_tufte(base_family = "Helvetica")
## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family
## not found in Windows font database

## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family
## not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

But how do we actually evaluate these topics? Here, the topics all seem pretty similar.

Evaluating topic model

Well, one way to evaluate the performance of unspervised forms of classification is by testing our model on an outcome that is already known.

Here, two topics that are most obvious are the ‘topics’ Volume 1 and Volume 2 of Tocqueville’s “Democracy in America.” Volume 1 of Tocqueville’s work deals more obviously with abstract constitutional ideas and questions of race; Volume 2 focuses on more esoteric aspects of American society. Listen an “In Our Time” episode with Melvyn Bragg discussing Democracy in America here.

Given these differences in focus, we might think that a generative model could accurately assign to topic (i.e., Volume) with some accuracy.

Plot relative word frequencies

First let’s have a look and see whether there really are words obviously distinguishing the two Volumes.

tidy_tocq <- tocq %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)
## Joining with `by = join_by(word)`
## Count most common words in both
tidy_tocq %>%
  count(word, sort = TRUE)
## # A tibble: 11,989 × 2
##    word           n
##    <chr>      <int>
##  1 people       859
##  2 power        803
##  3 united       783
##  4 democratic   774
##  5 government   693
##  6 time         618
##  7 nations      544
##  8 society      532
##  9 laws         528
## 10 country      520
## # ℹ 11,979 more rows
bookfreq <- tidy_tocq %>%
  mutate(booknumber = ifelse(gutenberg_id==815, "DiA1", "DiA2")) %>%
  mutate(word = str_extract(word, "[a-z']+")) %>%
  count(booknumber, word) %>%
  group_by(booknumber) %>%
  mutate(proportion = n / sum(n)) %>% 
  select(-n) %>% 
  spread(booknumber, proportion)

ggplot(bookfreq, aes(x = DiA1, y = DiA2, color = abs(DiA1 - DiA2))) +
  geom_abline(color = "gray40", lty = 2) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
  theme_tufte(base_family = "Helvetica") +
  theme(legend.position="none", 
        strip.background = element_blank(), 
        strip.text.x = element_blank()) +
  labs(x = "Tocqueville DiA 2", y = "Tocqueville DiA 1") +
  coord_equal()
## Warning: Removed 6100 rows containing missing values (`geom_point()`).
## Warning: Removed 6101 rows containing missing values (`geom_text()`).
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

We see that there do seem to be some marked distinguishing characteristics. In the plot above, for example, we see that more abstract notions of state systems appear with greater frequency in Volume 1 while Volume 2 seems to contain words specific to America (e.g., “north” and “south”) with greater frequency. The way to read the above plot is that words positioned further away from the diagonal line appear with greater frequency in one volume versus the other.

Split into chapter documents

In the below, we first separate the volumes into chapters, then we repeat the same procedure as above. The only difference now is that instead of two documents representing the two full volumes of Tocqueville’s work, we now have 132 documents, each representing an individual chapter. Notice now that the sparsity is much increased: around 96%.

tocq <- tocq %>%
  filter(!is.na(text))

# Divide into documents, each representing one chapter
tocq_chapter <- tocq %>%
  mutate(booknumber = ifelse(gutenberg_id==815, "DiA1", "DiA2")) %>%
  group_by(booknumber) %>%
  mutate(chapter = cumsum(str_detect(text, regex("^chapter ", ignore_case = TRUE)))) %>%
  ungroup() %>%
  filter(chapter > 0) %>%
  unite(document, booknumber, chapter)

# Split into words
tocq_chapter_word <- tocq_chapter %>%
  unnest_tokens(word, text)

# Find document-word counts
tocq_word_counts <- tocq_chapter_word %>%
  anti_join(stop_words) %>%
  count(document, word, sort = TRUE) %>%
  ungroup()
## Joining with `by = join_by(word)`
tocq_word_counts
## # A tibble: 44,633 × 3
##    document word           n
##    <chr>    <chr>      <int>
##  1 DiA1_1   people       204
##  2 DiA1_1   government   198
##  3 DiA1_3   union        198
##  4 DiA1_1   power        184
##  5 DiA1_1   union        157
##  6 DiA1_1   public       155
##  7 DiA1_1   united       154
##  8 DiA1_1   federal      139
##  9 DiA1_3   united       139
## 10 DiA1_3   south        134
## # ℹ 44,623 more rows
# Cast into DTM format for LDA analysis

tocq_chapters_dtm <- tocq_word_counts %>%
  cast_dtm(document, word, n)

tm::inspect(tocq_chapters_dtm)
## <<DocumentTermMatrix (documents: 79, terms: 11014)>>
## Non-/sparse entries: 44633/825473
## Sparsity           : 95%
## Maximal term length: 18
## Weighting          : term frequency (tf)
## Sample             :
##          Terms
## Docs      americans country democratic government nations people power society
##   DiA1_1         58      99         85        198      91    204   184      86
##   DiA1_2         86      88         92         63      46    106    73      85
##   DiA1_3        106      76         10        110      41     60    62      38
##   DiA2_46         3       4          4          1       4      3     2       6
##   DiA2_5          1       2         11          0       5      7     2       4
##   DiA2_60         9      10         10          0      12     16    11      15
##   DiA2_63         1       4         31          1      13     20     6      16
##   DiA2_73         2       4          9         33      19     12    34       3
##   DiA2_75         1       1         22         20      12      6    14      11
##   DiA2_76         5      11         10         24      12     31    27      16
##          Terms
## Docs      time united
##   DiA1_1    83    154
##   DiA1_2    52    123
##   DiA1_3    90    139
##   DiA2_46    9      2
##   DiA2_5     6      4
##   DiA2_60    8      6
##   DiA2_63   11      7
##   DiA2_73   16      0
##   DiA2_75   12      0
##   DiA2_76   50     88

We then re-estimate the topic model with this new DocumentTermMatrix object, specifying k equal to 2. This will enable us to evaluate whether a topic model is able to generatively assign to volume with accuracy.

#evaluate whether a topic model can generatively assign to volume with accuracy
tocq_chapters_lda <- LDA(tocq_chapters_dtm, k = 2, control = list(seed = 1234))

After this, it is worth looking at another output of the latent dirichlet allocation procedure. The γ probability represents the per-document-per-topic probability or, in other words, the probability that a given document (here: chapter) belongs to a particular topic (and here, we are assuming these topics represent volumes).

The gamma values are therefore the estimated proportion of words within a given chapter allocated to a given volume.

#gamma is porporition of words within a given chapter allocated to given vol
tocq_chapters_gamma <- tidy(tocq_chapters_lda, matrix = "gamma")
tocq_chapters_gamma
## # A tibble: 158 × 3
##    document topic     gamma
##    <chr>    <int>     <dbl>
##  1 DiA1_1       1 1.00     
##  2 DiA1_3       1 0.667    
##  3 DiA1_2       1 0.999    
##  4 DiA2_76      1 1.00     
##  5 DiA2_60      1 0.385    
##  6 DiA2_16      1 0.0000771
##  7 DiA2_22      1 0.000139 
##  8 DiA2_64      1 0.0000727
##  9 DiA2_73      1 0.0853   
## 10 DiA2_28      1 0.562    
## # ℹ 148 more rows

Examine consensus

Now that we have these topic probabilities, we can see how well our unsupervised learning did at distinguishing the two volumes generatively just from the words contained in each chapter.

# First separate the document name into title and chapter
tocq_chapters_gamma <- tocq_chapters_gamma %>%
  separate(document, c("title", "chapter"), sep = "_", convert = TRUE)

#make classifications by top gamma?
tocq_chapter_classifications <- tocq_chapters_gamma %>%
  group_by(title, chapter) %>%
  top_n(1, gamma) %>%
  ungroup()

#make book topic by title (volume of book)
tocq_book_topics <- tocq_chapter_classifications %>%
  count(title, topic) %>%
  group_by(title) %>%
  top_n(1, n) %>%
  ungroup() %>%
  transmute(consensus = title, topic)

#join the tocq_book_topics by their volume
tocq_chapter_classifications %>%
  inner_join(tocq_book_topics, by = "topic") %>%
  filter(title != consensus)
## # A tibble: 8 × 5
##   title chapter topic gamma consensus
##   <chr>   <int> <int> <dbl> <chr>    
## 1 DiA2       76     1 1.00  DiA1     
## 2 DiA2       28     1 0.562 DiA1     
## 3 DiA2       27     1 0.703 DiA1     
## 4 DiA2       54     1 0.740 DiA1     
## 5 DiA2       53     1 0.516 DiA1     
## 6 DiA2       51     1 0.616 DiA1     
## 7 DiA2       52     1 0.659 DiA1     
## 8 DiA2       44     1 0.520 DiA1
# Look document-word pairs were to see which words in each documents were assigned to a given topic
assignments <- augment(tocq_chapters_lda, data = tocq_chapters_dtm)
assignments
## # A tibble: 44,633 × 4
##    document term   count .topic
##    <chr>    <chr>  <dbl>  <dbl>
##  1 DiA1_1   people   204      1
##  2 DiA1_3   people    60      1
##  3 DiA1_2   people   106      1
##  4 DiA2_76  people    31      1
##  5 DiA2_60  people    16      2
##  6 DiA2_16  people     7      2
##  7 DiA2_22  people     2      2
##  8 DiA2_64  people     8      2
##  9 DiA2_73  people    12      2
## 10 DiA2_28  people     4      1
## # ℹ 44,623 more rows
#separate assignments by title and chapter
assignments <- assignments %>%
  separate(document, c("title", "chapter"), sep = "_", convert = TRUE) %>%
  inner_join(tocq_book_topics, by = c(".topic" = "topic"))


assignments %>%
  count(title, consensus, wt = count) %>%
  group_by(title) %>%
  mutate(percent = n / sum(n)) %>%
  ggplot(aes(consensus, title, fill = percent)) +
  geom_tile() +
  scale_fill_gradient2(high = "red", label = percent_format()) + #sorts the variables by high (red)
  geom_text(aes(x = consensus, y = title, label = scales::percent(percent))) +
  theme_tufte(base_family = "Helvetica") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        panel.grid = element_blank()) +
  labs(x = "Book words assigned to",
       y = "Book words came from",
       fill = "% of assignments")
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

Not bad! We see that the model estimated with accuracy 91% of chapters in Volume 2 and 79% of chapters in Volume 1 (Corrected: 94% in vol 1 and 77% in vol 2)

Validation

In the articles by @ying_topics_2021 and @denny_text_2018 from this and previous weeks, we read about potential validation techniques.

In this section, we’ll be using the preText package mentioned in @denny_text_2018 to see the impact of different pre-processing choices on our text. Here, I am adapting from a tutorial by Matthew Denny.

First we need to reformat our text into a quanteda corpus object.

# load in corpus of Tocequeville text data.
corp <- corpus(tocq, text_field = "text")
# use first 10 documents for example
documents <- corp[sample(1:30000,1000)]
# take a look at the document names
print(names(documents[1:10])) #what are these documents?
##  [1] "text27303" "text23936" "text1470"  "text16582" "text24195" "text28032"
##  [7] "text458"   "text6386"  "text21648" "text17997"

And now we are ready to preprocess in different ways. Here, we are including n-grams so we are preprocessing the text in 128 different ways. This takes about ten minutes to run on a machine with 8GB RAM.

#factorial pre processing incuding n-grams
preprocessed_documents <- factorial_preprocessing(
    documents,
    use_ngrams = TRUE,
    infrequent_term_threshold = 0.2,
    verbose = FALSE)

We can then get the results of our pre-processing, comparing the distance between documents that have been processed in different ways.

#results of preprocessing 917 minutes)
preText_results <- preText(
    preprocessed_documents,
    dataset_name = "Tocqueville text",
    distance_method = "cosine",
    num_comparisons = 20,
    verbose = FALSE)

And we can plot these accordingly.

preText_score_plot(preText_results)

————–Exercises——————-

  1. Choose another book or set of books from Project Gutenberg
  1. How to be Happy Though Married: Being a Handbook to Marriage by E. J. Hardy (35534), 1887 year, 260 pg
  2. Love and Marriage, Ellen Key(57592),1911 year, 350 pg We chose these books because we wanted to examine a male author’s perspective on marriage tips as compared to a female author’s persepctive. These books were published around roughly the same time and cover similar themes.
  1. Run your own topic model on these books, changing the k of topics, and evaluating accuracy.
#ctrl + alt + i to add code chunk
#load in the data
keyhar<- gutenberg_download(c(57592, 35534), 
                            meta_fields = "author")
#convert data to DTM, new column for which author, don or har
keyhar_words <- keyhar %>%
  mutate(author = ifelse(gutenberg_id==57592, "Key", "Hardy")) %>%
  unnest_tokens(word, text) %>%
  filter(!is.na(word)) %>%
  count(author, word, sort = TRUE) %>%
  ungroup() %>%
  anti_join(stop_words)
## Joining with `by = join_by(word)`
keyhar_dtm <- keyhar_words %>%
  cast_dtm(author, word, n)

tm::inspect(keyhar_dtm)
## <<DocumentTermMatrix (documents: 2, terms: 13031)>>
## Non-/sparse entries: 16604/9458
## Sparsity           : 36%
## Maximal term length: 18
## Weighting          : term frequency (tf)
## Sample             :
##        Terms
## Docs    children husband life love marriage married time wife woman women
##   Hardy      188     189  270  179      171     171   90  335   107    81
##   Key        325      70  523  731      300      73  186   84   369   288

2a) In the DTM we can see the difference in frequency for each word used by both Hardy and Key, with distinct differences being shown in terms like: love, wherein Key uses the word roughly 7x more than Hardy. The sparsity reflects the fact that we are looking at just two documents, as 36% is quite low compared to other data-frames which may have more missing values, this suggests our DTM is dense.

#estimating topic model
keyhar_lda <- LDA(keyhar_dtm, k = 6, control = list(seed = 1234))

#extract the beta (per topic per word prob)
keyhar_topics <- tidy(keyhar_lda, matrix = "beta")

head(keyhar_topics, n = 6)
## # A tibble: 6 × 3
##   topic term     beta
##   <int> <chr>   <dbl>
## 1     1 love  0.00689
## 2     2 love  0.00143
## 3     3 love  0.0238 
## 4     4 love  0.0185 
## 5     5 love  0.0126 
## 6     6 love  0.0103

2b) The betas listed here represent the probability that the given term belongs to a given topic. So, here, we see that the term “love” is most likely to belong to topic 3 (b=.0238).

#plot the prob 
keyhar_top_terms <- keyhar_topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

keyhar_top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free", ncol = 3) + #shows 4 in a row
  scale_y_reordered() +
  theme_tufte(base_family = "Helvetica")
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

2c) We divided the topics into 6 groups(k=6) this was because our data are smaller compared to the DiA data which used k=10. This is a visualization of the top terms in each topic ranked by their respective betas. We considered summarizing each topic by the words listed under them, but it we faced difficulty distinguishing some as their terms are too similar (could be because we picked books based on topic so it makes sense they will share many terms).

#relative word frequencies

tidy_keyhar <- keyhar %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)
## Joining with `by = join_by(word)`
# Count most common words in both
tidy_keyhar %>%
  count(word, sort = TRUE)
## # A tibble: 13,031 × 2
##    word         n
##    <chr>    <int>
##  1 love       910
##  2 life       793
##  3 children   513
##  4 woman      476
##  5 marriage   471
##  6 wife       419
##  7 women      369
##  8 time       276
##  9 husband    259
## 10 married    244
## # ℹ 13,021 more rows
bookfreq <- tidy_keyhar %>%
  mutate(author = ifelse(gutenberg_id==57592, "Key", "Hardy")) %>%
  mutate(word = str_extract(word, "[a-z']+")) %>%
  count(author, word) %>%
  group_by(author) %>%
  mutate(proportion = n / sum(n)) %>% 
  select(-n) %>% 
  spread(author, proportion)

ggplot(bookfreq, aes(x = Key, y = Hardy, color = abs(Key - Hardy))) + #absolute difference in frequency between Key and hardy
  geom_abline(color = "black", lty = 2) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "red") +
  theme_tufte(base_family = "Helvetica") +
  theme(legend.position="none", 
        strip.background = element_blank(), 
        strip.text.x = element_blank()) +
  labs(x = "Key", y = "Hardy") +
  coord_equal()
## Warning: Removed 8776 rows containing missing values (`geom_point()`).
## Warning: Removed 8777 rows containing missing values (`geom_text()`).
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

2d) This frequency graph looks at the words most used in Key’s writing versus Hardy’s. Terms like “Blood” which is bright red and on Key’s side of the graph indicates that the termis used more frequently in Key’s writing and because of the color suggests she uses it a lot more frequently. Note: Red = high abs value between word usage, Green = low abs value between word usage, Grey = middle ground between usage, like if one uses a word slightly more often than the other???

#split into chapters
keyhar <- keyhar %>%
  filter(!is.na(text))

# Divide into documents, each representing one chapter
keyhar_chapter <- keyhar %>%
  mutate(author = ifelse(gutenberg_id==57592, "Key", "Hardy")) %>%
  group_by(author) %>%
  mutate(chapter = cumsum(str_detect(text, regex("^chapter ", ignore_case = TRUE)))) %>%
  ungroup() %>%
  filter(chapter > 0) %>%
  unite(document, author, chapter)

# Split into words
keyhar_chapter_word <- keyhar_chapter %>%
  unnest_tokens(word, text)

# Find document-word count
keyhar_word_counts <- keyhar_chapter_word %>%
  anti_join(stop_words) %>%
  count(document, word, sort = TRUE) %>%
  ungroup()
## Joining with `by = join_by(word)`
keyhar_word_counts
## # A tibble: 36,655 × 3
##    document word         n
##    <chr>    <chr>    <int>
##  1 Key_2    love       175
##  2 Key_8    love       145
##  3 Key_1    love       135
##  4 Hardy_29 8vo        107
##  5 Key_1    life        96
##  6 Key_8    life        96
##  7 Key_2    woman       94
##  8 Hardy_29 cloth       89
##  9 Key_6    children    85
## 10 Hardy_29 6d          83
## # ℹ 36,645 more rows
# Cast into DTM format for LDA analysis

keyhar_chapters_dtm <- keyhar_word_counts %>%
  cast_dtm(document, word, n)

tm::inspect(keyhar_chapters_dtm)
## <<DocumentTermMatrix (documents: 38, terms: 12834)>>
## Non-/sparse entries: 36655/451037
## Sparsity           : 92%
## Maximal term length: 18
## Weighting          : term frequency (tf)
## Sample             :
##           Terms
## Docs       children husband life love marriage people time wife woman women
##   Hardy_29       16      14   65   20       23      5   13   56     4    10
##   Key_1          16       3   96  135       71     13   19    7    11    14
##   Key_2           2       4   53  175       30      5   21    6    94    56
##   Key_3          35       1   49   78       33     21   21    5    18     7
##   Key_4          16       5   32   81       23     10   12    5    21     7
##   Key_5          10       2   58   59       22      5   11    3    49    49
##   Key_6          85       8   52   17        3      7   26    4    40    39
##   Key_7          21       2   47   11        2      1   18    4    73    72
##   Key_8          77       7   96  145       63     48   25   10    22    10
##   Key_9          62      38   33   22       50     17   25   39    31    23

2e) When splitting by chapters, our sparsity jumps from 36% to 92%, which just affirms that we do have many values categorized as 0 in this DTM. Can we have an explanation her, donhar returned different results?

#restimate topic model with new dtm k=2, 
keyhar_chapters_lda <- LDA(keyhar_chapters_dtm, k = 2, control = list(seed = 1234))

#gamma val estimates
keyhar_chapters_gamma <- tidy(keyhar_chapters_lda, matrix = "gamma")
keyhar_chapters_gamma
## # A tibble: 76 × 3
##    document topic      gamma
##    <chr>    <int>      <dbl>
##  1 Key_2        1 0.996     
##  2 Key_8        1 1.00      
##  3 Key_1        1 1.00      
##  4 Hardy_29     1 0.00000653
##  5 Key_6        1 1.00      
##  6 Key_4        1 1.00      
##  7 Key_3        1 1.00      
##  8 Key_7        1 1.00      
##  9 Key_9        1 1.00      
## 10 Key_5        1 1.00      
## # ℹ 66 more rows

2f) The gamma values returned here show us the estimated proportion of words within a given chapter allocated to a given author. This means that within topic 1, key_8 , which we think is chapter 8 of key’s writing demonstrates the highest proportion of words that are likely to be allocated to key. Key_8 is most domonstrative of Key’s writing.

#unsupervised learning distinguish models
#first separate the document name into title and chapter
keyhar_chapters_gamma <- keyhar_chapters_gamma %>%
  separate(document, c("title", "chapter"), sep = "_", convert = TRUE)

keyhar_chapter_classifications <- keyhar_chapters_gamma %>%
  group_by(title, chapter) %>%
  top_n(1, gamma) %>%
  ungroup()

keyhar_book_topics <- keyhar_chapter_classifications %>%
  count(title, topic) %>%
  group_by(title) %>%
  top_n(1, n) %>%
  ungroup() %>%
  transmute(consensus = title, topic)

keyhar_chapter_classifications %>%
  inner_join(keyhar_book_topics, by = "topic") %>%
  filter(title != consensus)
## # A tibble: 1 × 5
##   title chapter topic gamma consensus
##   <chr>   <int> <int> <dbl> <chr>    
## 1 Hardy      15     1 0.599 Key
# Look document-word pairs were to see which words in each documents were assigned
# to a given topic

assignments_kh <- augment(keyhar_chapters_lda, data = keyhar_chapters_dtm)
assignments_kh
## # A tibble: 36,655 × 4
##    document term  count .topic
##    <chr>    <chr> <dbl>  <dbl>
##  1 Key_2    love    175      1
##  2 Key_8    love    145      1
##  3 Key_1    love    135      1
##  4 Hardy_29 love     20      2
##  5 Key_6    love     17      1
##  6 Key_4    love     81      1
##  7 Key_3    love     78      1
##  8 Key_7    love     11      1
##  9 Key_9    love     22      1
## 10 Key_5    love     59      1
## # ℹ 36,645 more rows
assignments_kh <- assignments_kh %>%
  separate(document, c("title", "chapter"), sep = "_", convert = TRUE) %>%
  inner_join(keyhar_book_topics, by = c(".topic" = "topic"))

assignments_kh %>%
  count(title, consensus, wt = count) %>%
  group_by(title) %>%
  mutate(percent = n / sum(n)) %>%
  ggplot(aes(consensus, title, fill = percent)) +
  geom_tile() +
  scale_fill_gradient2(high = "red", label = percent_format()) +
  geom_text(aes(x = consensus, y = title, label = scales::percent(percent))) +
  theme_tufte(base_family = "Helvetica") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        panel.grid = element_blank()) +
  labs(x = "Book words assigned to",
       y = "Book words came from",
       fill = "% of assignments")
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

2g) After running the unsupervised learning model We see that the model estimated with accuracy 92.5% of the words in Hardy’s writing and and 100% of words in Key. This model is promising as it shows us that it will preform well if we were to give it some more unseen text from Hardy, and it would be able to categorize it as his work. However, the 100% success rate in Key’s writing may show that the model is overfit and has merely learned the data; this would mean it is not going to be able to successfully categorize Key’s writing as her own if we were to show it an unseen sample.

  1. Validate different pre-processing techniques using preText on the new book(s) of your choice.
#reformat text into quanteda corpus 
# load in corpus of keyhar text data.
corp <- corpus(keyhar, text_field = "text")
# use first 10 documents for example
documents <- corp[sample(1:3000,100)]
# take a look at the document names
print(names(documents[1:10]))
##  [1] "text2203" "text1761" "text1805" "text1986" "text615"  "text2321"
##  [7] "text2445" "text2752" "text586"  "text872"
#n-gram preprocessing
preprocessed_documents <- factorial_preprocessing(
    documents,
    use_ngrams = TRUE,
    infrequent_term_threshold = 0.2, #the frequency of different words in the documents less than 20%
    verbose = FALSE)
#results of preprocessing
preText_results <- preText(
    preprocessed_documents,
    dataset_name = "Keyhar text",
    distance_method = "cosine",
    num_comparisons = 20,
    verbose = FALSE)
#plot the results 
preText_score_plot(preText_results)
  1. This plot of the preText Score results compares the distance between documents that have been processed in 128 different ways. We set the threshold at .2 which means any terms with less than 20% frequency were removed. The higher the preText score, the higher the similarity between the corpus objects, we think that our preText score is consitently positive due to the texts having similar themes.

##—————–Donovan and Hardy (first attempt, issue with accuracy) ———————-

#get the two books
Donhar <- gutenberg_download(c(53368,35534), 
                            meta_fields = "author")
Donhar
## # A tibble: 12,212 × 3
##    gutenberg_id text                                                      author
##           <int> <chr>                                                     <chr> 
##  1        35534 "[Transcriber's note: The author's spelling has been mai… Hardy…
##  2        35534 ""                                                        Hardy…
##  3        35534 "+ signs around words indicate the use of a different fo… Hardy…
##  4        35534 ""                                                        Hardy…
##  5        35534 "In the word \"Puranic\", the \"a\" is overlined in the … Hardy…
##  6        35534 ""                                                        Hardy…
##  7        35534 ""                                                        Hardy…
##  8        35534 ""                                                        Hardy…
##  9        35534 ""                                                        Hardy…
## 10        35534 "_HOW TO BE HAPPY THOUGH MARRIED._"                       Hardy…
## # ℹ 12,202 more rows
#prepare the document
Donhar_words <- Donhar %>%
  mutate(author = ifelse(gutenberg_id==53368, "Donovan", "Hardy")) %>%
  unnest_tokens(word, text) %>%
  filter(!is.na(word)) %>%
  count(author, word, sort = TRUE) %>%
  ungroup() %>%
  anti_join(stop_words)
## Joining with `by = join_by(word)`
Donhar_words
## # A tibble: 12,741 × 3
##    author  word         n
##    <chr>   <chr>    <int>
##  1 Hardy   wife       335
##  2 Hardy   life       270
##  3 Hardy   husband    189
##  4 Hardy   children   188
##  5 Hardy   love       179
##  6 Hardy   marriage   171
##  7 Hardy   married    171
##  8 Hardy   home       136
##  9 Hardy   day        127
## 10 Donovan marry      121
## # ℹ 12,731 more rows
#turn it into a document term matrix
Donhar_dtm <- Donhar_words %>%
  cast_dtm(author, word, n)

tm::inspect(Donhar_dtm)
## <<DocumentTermMatrix (documents: 2, terms: 10644)>>
## Non-/sparse entries: 12741/8547
## Sparsity           : 40%
## Maximal term length: 18
## Weighting          : term frequency (tf)
## Sample             :
##          Terms
## Docs      children home husband life love marriage married marry wife woman
##   Donovan       16   23      13   62   42       38      19   121   27    44
##   Hardy        188  136     189  270  179      171     171    40  335   107
#specify how many topics are there
Donhar_lda <- LDA(Donhar_dtm, k = 6, control = list(seed = 1234))

#extra the per-word-per-topic probabilities
Donhar_topics <- tidy(Donhar_lda, matrix = "beta")

head(Donhar_topics, n = 6)
## # A tibble: 6 × 3
##   topic term       beta
##   <int> <chr>     <dbl>
## 1     1 wife  0.00864  
## 2     2 wife  0.0000973
## 3     3 wife  0.0116   
## 4     4 wife  0.0160   
## 5     5 wife  0.000839 
## 6     6 wife  0.0136
##so the term "wife" is most likely to belong to topic 4

#plot the results
Donhar_top_terms <- Donhar_topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

Donhar_top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free", ncol = 3) +
  scale_y_reordered() +
  theme_tufte(base_family = "Helvetica")
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

##have a look at whether there are specific word that can distinguish the two books
tidy_Donhar <- Donhar %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)
## Joining with `by = join_by(word)`
## Count most common words in both
tidy_Donhar %>%
  count(word, sort = TRUE)
## # A tibble: 10,644 × 2
##    word         n
##    <chr>    <int>
##  1 wife       362
##  2 life       332
##  3 love       221
##  4 marriage   209
##  5 children   204
##  6 husband    202
##  7 married    190
##  8 marry      161
##  9 home       159
## 10 woman      151
## # ℹ 10,634 more rows
bookfreq3 <- tidy_Donhar %>%
  mutate(author = ifelse(gutenberg_id==53368, "Donovan", "Hardy")) %>%
  mutate(word = str_extract(word, "[a-z']+")) %>%
  count(author, word) %>%
  group_by(author) %>%
  mutate(proportion = n / sum(n)) %>% 
  select(-n) %>% 
  spread(author, proportion)
head(bookfreq3)
## # A tibble: 6 × 3
##   word        Donovan     Hardy
##   <chr>         <dbl>     <dbl>
## 1 a         NA        0.000213 
## 2 aback     NA        0.0000305
## 3 abandon    0.000141 0.0000305
## 4 abandoned NA        0.0000305
## 5 abandons  NA        0.0000305
## 6 abated    NA        0.0000305
ggplot(bookfreq3, aes(x = Donovan, y = Hardy, color = abs(Donovan - Hardy))) +
  geom_abline(color = "orange", lty = 2) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "purple") +
  theme_tufte(base_family = "Helvetica") +
  theme(legend.position="none", 
        strip.background = element_blank(), 
        strip.text.x = element_blank()) +
  labs(x = "Donovan", y = "Hardy") +
  coord_equal()
## Warning: Removed 8046 rows containing missing values (`geom_point()`).
## Warning: Removed 8047 rows containing missing values (`geom_text()`).
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

##split into chapter
Donhar <- Donhar %>%
  filter(!is.na(text))
Donhar
## # A tibble: 12,212 × 3
##    gutenberg_id text                                                      author
##           <int> <chr>                                                     <chr> 
##  1        35534 "[Transcriber's note: The author's spelling has been mai… Hardy…
##  2        35534 ""                                                        Hardy…
##  3        35534 "+ signs around words indicate the use of a different fo… Hardy…
##  4        35534 ""                                                        Hardy…
##  5        35534 "In the word \"Puranic\", the \"a\" is overlined in the … Hardy…
##  6        35534 ""                                                        Hardy…
##  7        35534 ""                                                        Hardy…
##  8        35534 ""                                                        Hardy…
##  9        35534 ""                                                        Hardy…
## 10        35534 "_HOW TO BE HAPPY THOUGH MARRIED._"                       Hardy…
## # ℹ 12,202 more rows
#it can't be clear as some NAs stay in the dataset


# Divide into documents, each representing one chapter
Donhar_chapter <- Donhar %>%
  mutate(author = ifelse(gutenberg_id==53368, "Donovan", "Hardy")) %>%
  group_by(author) %>%
  mutate(chapter = cumsum(str_detect(text, regex("^chapter ", ignore_case = TRUE)))) %>%
  ungroup() %>%
  filter(chapter > 0) %>%
  unite(document, author, chapter)
Donhar_chapter
## # A tibble: 9,864 × 3
##    gutenberg_id text                                                    document
##           <int> <chr>                                                   <chr>   
##  1        35534 "CHAPTER I."                                            Hardy_1 
##  2        35534 ""                                                      Hardy_1 
##  3        35534 "HOW TO BE HAPPY _THOUGH_ MARRIED."                     Hardy_1 
##  4        35534 ""                                                      Hardy_1 
##  5        35534 "  \"How delicious is the winning"                      Hardy_1 
##  6        35534 "  Of a kiss at love's beginning,"                      Hardy_1 
##  7        35534 "  When two mutual hearts are sighing"                  Hardy_1 
##  8        35534 "  For the knot there's no untying!\"--_T. Campbell._"  Hardy_1 
##  9        35534 ""                                                      Hardy_1 
## 10        35534 "     \"Deceive not thyself by over-expecting happines… Hardy_1 
## # ℹ 9,854 more rows
# Split into words
Donhar_chapter_word <- Donhar_chapter %>%
  unnest_tokens(word, text)
Donhar_chapter_word
## # A tibble: 86,377 × 3
##    gutenberg_id document word     
##           <int> <chr>    <chr>    
##  1        35534 Hardy_1  chapter  
##  2        35534 Hardy_1  i        
##  3        35534 Hardy_1  how      
##  4        35534 Hardy_1  to       
##  5        35534 Hardy_1  be       
##  6        35534 Hardy_1  happy    
##  7        35534 Hardy_1  _though_ 
##  8        35534 Hardy_1  married  
##  9        35534 Hardy_1  how      
## 10        35534 Hardy_1  delicious
## # ℹ 86,367 more rows
# Find document-word counts
Donhar_word_counts <- Donhar_chapter_word %>%
  anti_join(stop_words) %>%
  count(document, word, sort = TRUE) %>%
  ungroup()
## Joining with `by = join_by(word)`
Donhar_word_counts
## # A tibble: 20,995 × 3
##    document word         n
##    <chr>    <chr>    <int>
##  1 Hardy_29 8vo        107
##  2 Hardy_29 cloth       89
##  3 Hardy_29 6d          83
##  4 Hardy_29 author      76
##  5 Hardy_29 book        65
##  6 Hardy_29 life        65
##  7 Hardy_29 crown       64
##  8 Hardy_29 edition     59
##  9 Hardy_29 wife        56
## 10 Hardy_19 children    42
## # ℹ 20,985 more rows
# Cast into DTM format for LDA analysis

Donhar_chapters_dtm <- Donhar_word_counts %>%
  cast_dtm(document, word, n)

tm::inspect(Donhar_chapters_dtm)
## <<DocumentTermMatrix (documents: 29, terms: 9390)>>
## Non-/sparse entries: 20995/251315
## Sparsity           : 92%
## Maximal term length: 18
## Weighting          : term frequency (tf)
## Sample             :
##           Terms
## Docs       children day home husband life love marriage married people wife
##   Hardy_10        3   8   10      22    4   14       12       4      2   31
##   Hardy_11        0  10    6      17   13   11        6      10      6   21
##   Hardy_19       42   4   10       6    6    8        0       0      3    6
##   Hardy_2         7   3    5       3   21    4       25      15      3   12
##   Hardy_24        1  12   13      22   18    2        2       7      3   21
##   Hardy_27        4   2    3       5    4    1        2       1      4    5
##   Hardy_29       16  18   12      14   65   20       23      16      5   56
##   Hardy_3         2   4    1       3   11    4       13       5      1   37
##   Hardy_4         1   3    2      11    6    5        6       3      1   29
##   Hardy_6         1   8    1       5    6    8       14      10     11    9
##no Key data are shown in the chart, maybe cuz the Key data is too small compared to the hardy data

#
Donhar_chapters_lda <- LDA(Donhar_chapters_dtm, k = 2, control = list(seed = 1234))


#
Donhar_chapters_gamma <- tidy(Donhar_chapters_lda, matrix = "gamma")

Donhar_chapters_gamma
## # A tibble: 58 × 3
##    document topic     gamma
##    <chr>    <int>     <dbl>
##  1 Hardy_29     1 1.00     
##  2 Hardy_19     1 1.00     
##  3 Hardy_14     1 0.0000192
##  4 Hardy_18     1 1.00     
##  5 Hardy_3      1 0.0000145
##  6 Hardy_10     1 0.0000134
##  7 Hardy_27     1 0.0000174
##  8 Hardy_4      1 0.769    
##  9 Hardy_16     1 0.0000215
## 10 Hardy_2      1 1.00     
## # ℹ 48 more rows
#
# First separate the document name into title and chapter

Donhar_chapters_gamma <- Donhar_chapters_gamma %>%
  separate(document, c("title", "chapter"), sep = "_", convert = TRUE)

Donhar_chapter_classifications <- Donhar_chapters_gamma %>%
  group_by(title, chapter) %>%
  top_n(1, gamma) %>%
  ungroup()

Donhar_book_topics <- Donhar_chapter_classifications %>%
  count(title, topic) %>%
  group_by(title) %>%
  top_n(1, n) %>%
  ungroup() %>%
  transmute(consensus = title, topic)

Donhar_chapter_classifications %>%
  inner_join(Donhar_book_topics, by = "topic") %>%
  filter(title != consensus)
## # A tibble: 0 × 5
## # ℹ 5 variables: title <chr>, chapter <int>, topic <int>, gamma <dbl>,
## #   consensus <chr>
# Look document-word pairs were to see which words in each documents were assigned
# to a given topic

assignments3 <- augment(Donhar_chapters_lda, data = Donhar_chapters_dtm)
assignments3
## # A tibble: 20,995 × 4
##    document term   count .topic
##    <chr>    <chr>  <dbl>  <dbl>
##  1 Hardy_29 8vo      107      1
##  2 Hardy_29 cloth     89      1
##  3 Hardy_29 6d        83      1
##  4 Hardy_29 author    76      1
##  5 Hardy_14 author     1      2
##  6 Hardy_3  author     1      2
##  7 Hardy_10 author     1      2
##  8 Hardy_4  author     1      1
##  9 Hardy_26 author     1      2
## 10 Hardy_29 book      65      1
## # ℹ 20,985 more rows
assignments3 <- assignments3 %>%
  separate(document, c("title", "chapter"), sep = "_", convert = TRUE) %>%
  inner_join(Donhar_book_topics, by = c(".topic" = "topic"))

assignments3 %>%
  count(title, consensus, wt = count) %>%
  group_by(title) %>%
  mutate(percent = n / sum(n)) %>%
  ggplot(aes(consensus, title, fill = percent)) +
  geom_tile() +
  scale_fill_gradient2(high = "red", label = percent_format()) +
  geom_text(aes(x = consensus, y = title, label = scales::percent(percent))) +
  theme_tufte(base_family = "Helvetica") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        panel.grid = element_blank()) +
  labs(x = "Book words assigned to",
       y = "Book words came from",
       fill = "% of assignments")
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

##————BRONTES (first issue with accuracy fixed, but this accuracy model performs poorly, overtuning)—————-

##wondering whether the problem is the samples are too short, use the Bromtes works as samples.

#get the two books
bronte <- gutenberg_download(c(768,1260), 
                            meta_fields = "author")

#prepare the document
bronte_words <- bronte %>%
  mutate(author = ifelse(gutenberg_id==768, "Emily", "Charlotte")) %>%
  unnest_tokens(word, text) %>%
  filter(!is.na(word)) %>%
  count(author, word, sort = TRUE) %>%
  ungroup() %>%
  anti_join(stop_words)
## Joining with `by = join_by(word)`
#turn it into a document term matrix
bronte_dtm <- bronte_words %>%
  cast_dtm(author, word, n)

tm::inspect(bronte_dtm)
## <<DocumentTermMatrix (documents: 2, terms: 15328)>>
## Non-/sparse entries: 21265/9391
## Sparsity           : 31%
## Maximal term length: 18
## Weighting          : term frequency (tf)
## Sample             :
##            Terms
## Docs        catherine day don’t heathcliff house jane linton miss sir time
##   Charlotte         1 232   155          0   182  341      0  310 316  244
##   Emily           336 105   180        421   142    0    346  129  43  128
#specify how many topics are there
bronte_lda <- LDA(bronte_dtm, k = 6, control = list(seed = 1234))

#extra the per-word-per-topic probabilities
bronte_topics <- tidy(bronte_lda, matrix = "beta")

head(bronte_topics, n = 6)
## # A tibble: 6 × 3
##   topic term           beta
##   <int> <chr>         <dbl>
## 1     1 heathcliff 1.86e-27
## 2     2 heathcliff 4.89e-20
## 3     3 heathcliff 6.22e- 3
## 4     4 heathcliff 1.16e- 2
## 5     5 heathcliff 5.61e-15
## 6     6 heathcliff 8.71e-30
#it's interesting that the first word always seems to belong to the fourth chapter!

#plot the results
bronte_top_terms <- bronte_topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

bronte_top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free", ncol = 3) +
  scale_y_reordered() +
  theme_tufte(base_family = "Helvetica")
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

##have a look at whether there are specific word that can distinguish the two books
tidy_bronte <- bronte %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)
## Joining with `by = join_by(word)`
## Count most common words in both
tidy_bronte %>%
  count(word, sort = TRUE)
## # A tibble: 15,328 × 2
##    word           n
##    <chr>      <int>
##  1 miss         439
##  2 heathcliff   421
##  3 time         372
##  4 sir          359
##  5 linton       346
##  6 jane         341
##  7 catherine    337
##  8 day          337
##  9 don’t        335
## 10 house        324
## # ℹ 15,318 more rows
bookfreq2 <- tidy_bronte %>%
  mutate(author = ifelse(gutenberg_id==768, "Emily", "Charlotte")) %>%
  mutate(word = str_extract(word, "[a-z']+")) %>%
  count(author, word) %>%
  group_by(author) %>%
  mutate(proportion = n / sum(n)) %>% 
  select(-n) %>% 
  spread(author, proportion)
head(bookfreq2)
## # A tibble: 6 × 3
##   word         Charlotte     Emily
##   <chr>            <dbl>     <dbl>
## 1 a            0.000170  0.0000473
## 2 abaht       NA         0.0000237
## 3 abandon      0.0000463 0.0000237
## 4 abandoned    0.000123  0.0000947
## 5 abandonment  0.0000309 0.0000237
## 6 abashed     NA         0.0000237
ggplot(bookfreq2, aes(x = Emily, y = Charlotte, color = abs(Emily - Charlotte))) +
  geom_abline(color = "orange", lty = 2) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "purple") +
  theme_tufte(base_family = "Helvetica") +
  theme(legend.position="none", 
        strip.background = element_blank(), 
        strip.text.x = element_blank()) +
  labs(x = "Emily", y = "Charlotte") +
  coord_equal()
## Warning: Removed 8982 rows containing missing values (`geom_point()`).
## Warning: Removed 8983 rows containing missing values (`geom_text()`).
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

##split into chapter
bronte <- bronte %>%
  filter(!is.na(text))

# Divide into documents, each representing one chapter
bronte_chapter <- bronte %>%
  mutate(author = ifelse(gutenberg_id==768, "Emily", "Charlotte")) %>%
  group_by(author) %>%
  mutate(chapter = cumsum(str_detect(text, regex("^chapter ", ignore_case = TRUE)))) %>%
  ungroup() %>%
  filter(chapter > 0) %>%
  unite(document, author, chapter)

# Split into words
bronte_chapter_word <- bronte_chapter %>%
  unnest_tokens(word, text)

# Find document-word counts
bronte_word_counts <- bronte_chapter_word %>%
  anti_join(stop_words) %>%
  count(document, word, sort = TRUE) %>%
  ungroup()
## Joining with `by = join_by(word)`
bronte_word_counts
## # A tibble: 73,614 × 3
##    document     word           n
##    <chr>        <chr>      <int>
##  1 Charlotte_27 jane          53
##  2 Charlotte_38 jane          51
##  3 Charlotte_38 sir           40
##  4 Emily_10     heathcliff    37
##  5 Charlotte_24 sir           35
##  6 Charlotte_34 st            35
##  7 Charlotte_18 ingram        34
##  8 Emily_21     linton        34
##  9 Emily_27     catherine     33
## 10 Charlotte_34 john          32
## # ℹ 73,604 more rows
# Cast into DTM format for LDA analysis

bronte_chapters_dtm <- bronte_word_counts %>%
  cast_dtm(document, word, n)

tm::inspect(bronte_chapters_dtm)
## <<DocumentTermMatrix (documents: 73, terms: 15241)>>
## Non-/sparse entries: 73614/1038979
## Sparsity           : 93%
## Maximal term length: 18
## Weighting          : term frequency (tf)
## Sample             :
##               Terms
## Docs           catherine day don’t heathcliff house jane linton miss sir time
##   Charlotte_17         0   8     9          0     5    0      0   16   7   12
##   Charlotte_21         0  17     3          0     7   21      0   17  28   10
##   Charlotte_24         0  12    17          0     2   22      0    7  35   11
##   Charlotte_27         0   9    10          0     7   53      0    1  23   16
##   Charlotte_28         0  10     4          0    14    1      0    0   1    8
##   Charlotte_34         0  13     5          0    11   20      0    1   4   10
##   Charlotte_38         0   7     6          0    12   51      0    4  40    5
##   Emily_10            19   7    10         37     6    0     27    7   3    8
##   Emily_17            13   6    10         27     7    0      9    3   1    5
##   Emily_21            22  10    12         21     8    0     34   17   0   11
#re-estimate the topic model with this DTM object
bronte_chapters_lda <- LDA(bronte_chapters_dtm, k = 2, control = list(seed = 1234))


#get the gamma value--the estimated proportion of words within a given chapter allocated to a given volume
bronte_chapters_gamma <- tidy(bronte_chapters_lda, matrix = "gamma")
bronte_chapters_gamma
## # A tibble: 146 × 3
##    document     topic gamma
##    <chr>        <int> <dbl>
##  1 Charlotte_27     1 0.444
##  2 Charlotte_38     1 0.472
##  3 Emily_10         1 0.500
##  4 Charlotte_24     1 0.530
##  5 Charlotte_34     1 0.418
##  6 Charlotte_18     1 0.495
##  7 Emily_21         1 0.596
##  8 Emily_27         1 0.559
##  9 Charlotte_4      1 0.591
## 10 Charlotte_5      1 0.305
## # ℹ 136 more rows
## not quite understand how it works



# First separate the document name into title and chapter

bronte_chapters_gamma <- bronte_chapters_gamma %>%
  separate(document, c("title", "chapter"), sep = "_", convert = TRUE)

bronte_chapter_classifications <- bronte_chapters_gamma %>%
  group_by(title, chapter) %>%
  top_n(1, gamma) %>%
  ungroup()

bronte_book_topics <- bronte_chapter_classifications %>%
  count(title, topic) %>%
  group_by(title) %>%
  top_n(1, n) %>%
  ungroup() %>%
  transmute(consensus = title, topic)

bronte_chapter_classifications %>%
  inner_join(bronte_book_topics, by = "topic") %>%
  filter(title != consensus)
## # A tibble: 31 × 5
##    title     chapter topic gamma consensus
##    <chr>       <int> <int> <dbl> <chr>    
##  1 Charlotte      24     1 0.530 Emily    
##  2 Charlotte       4     1 0.591 Emily    
##  3 Charlotte      20     1 0.509 Emily    
##  4 Charlotte      21     1 0.539 Emily    
##  5 Charlotte       3     1 0.579 Emily    
##  6 Charlotte      11     1 0.558 Emily    
##  7 Charlotte      16     1 0.593 Emily    
##  8 Charlotte      13     1 0.556 Emily    
##  9 Charlotte      29     1 0.698 Emily    
## 10 Charlotte      10     1 0.503 Emily    
## # ℹ 21 more rows
# Look document-word pairs were to see which words in each documents were assigned to a given topic

assignments2 <- augment(bronte_chapters_lda, data = bronte_chapters_dtm)
assignments2
## # A tibble: 73,614 × 4
##    document     term  count .topic
##    <chr>        <chr> <dbl>  <dbl>
##  1 Charlotte_27 jane     53      2
##  2 Charlotte_38 jane     51      2
##  3 Charlotte_24 jane     22      2
##  4 Charlotte_34 jane     20      2
##  5 Charlotte_4  jane     14      1
##  6 Charlotte_5  jane      3      2
##  7 Charlotte_25 jane     21      2
##  8 Charlotte_26 jane      5      2
##  9 Charlotte_20 jane     24      2
## 10 Charlotte_21 jane     21      2
## # ℹ 73,604 more rows
assignments2 <- assignments2 %>%
  separate(document, c("title", "chapter"), sep = "_", convert = TRUE) %>%
  inner_join(bronte_book_topics, by = c(".topic" = "topic"))

assignments2 %>%
  count(title, consensus, wt = count) %>%
  group_by(title) %>%
  mutate(percent = n / sum(n)) %>%
  ggplot(aes(consensus, title, fill = percent)) +
  geom_tile() +
  scale_fill_gradient2(high = "red", label = percent_format()) +
  geom_text(aes(x = consensus, y = title, label = scales::percent(percent))) +
  theme_tufte(base_family = "Helvetica") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        panel.grid = element_blank()) +
  labs(x = "Book words assigned to",
       y = "Book words came from",
       fill = "% of assignments")
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

##reformat the data into a “quanteda” corpus object

# load in corpus of Tocequeville text data.
corp2 <- corpus(bronte, text_field = "text")
# use first 10 documents for example
documents2 <- corp[sample(1:3000,100)]
# take a look at the document names
print(names(documents2[1:10]))
##  [1] "text2204" "text1118" "text2610" "text751"  "text2732" "text1426"
##  [7] "text1675" "text2698" "text1091" "text1322"
#preprocessing the data in 128 different ways
preprocessed_documents2 <- factorial_preprocessing(
    documents2,
    use_ngrams = TRUE,
    infrequent_term_threshold = 0.2,
    verbose = FALSE)
## Preprocessing 100 documents 128 different ways...
#compare the distance between documents that have been processed in different ways
preText_results2 <- preText(
    preprocessed_documents2,
    dataset_name = "Bronte text",
    distance_method = "cosine",
    num_comparisons = 20,
    verbose = FALSE)
## Generating document distances...
## Warning in stats::cmdscale(distances, k = dimensions): only 0 of the first 2
## eigenvalues are > 0

## Warning in stats::cmdscale(distances, k = dimensions): only 0 of the first 2
## eigenvalues are > 0

## Warning in stats::cmdscale(distances, k = dimensions): only 0 of the first 2
## eigenvalues are > 0

## Warning in stats::cmdscale(distances, k = dimensions): only 0 of the first 2
## eigenvalues are > 0

## Warning in stats::cmdscale(distances, k = dimensions): only 0 of the first 2
## eigenvalues are > 0

## Warning in stats::cmdscale(distances, k = dimensions): only 0 of the first 2
## eigenvalues are > 0

## Warning in stats::cmdscale(distances, k = dimensions): only 0 of the first 2
## eigenvalues are > 0

## Warning in stats::cmdscale(distances, k = dimensions): only 0 of the first 2
## eigenvalues are > 0

## Warning in stats::cmdscale(distances, k = dimensions): only 0 of the first 2
## eigenvalues are > 0

## Warning in stats::cmdscale(distances, k = dimensions): only 0 of the first 2
## eigenvalues are > 0

## Warning in stats::cmdscale(distances, k = dimensions): only 0 of the first 2
## eigenvalues are > 0

## Warning in stats::cmdscale(distances, k = dimensions): only 0 of the first 2
## eigenvalues are > 0

## Warning in stats::cmdscale(distances, k = dimensions): only 0 of the first 2
## eigenvalues are > 0

## Warning in stats::cmdscale(distances, k = dimensions): only 0 of the first 2
## eigenvalues are > 0

## Warning in stats::cmdscale(distances, k = dimensions): only 0 of the first 2
## eigenvalues are > 0

## Warning in stats::cmdscale(distances, k = dimensions): only 0 of the first 2
## eigenvalues are > 0
## Generating preText Scores...
## Generating regression results..
## The R^2 for this model is: 0.4171211 
## Regression results (negative coefficients imply less risk):
##                  Variable Coefficient    SE
## 1               Intercept       0.088 0.008
## 2      Remove Punctuation      -0.020 0.006
## 3          Remove Numbers       0.001 0.006
## 4               Lowercase       0.000 0.006
## 5                Stemming       0.001 0.006
## 6        Remove Stopwords      -0.008 0.006
## 7 Remove Infrequent Terms       0.048 0.006
## 8              Use NGrams      -0.008 0.006
## Complete in: 15.25 seconds...
#plot accordingly
preText_score_plot(preText_results2)
## Warning in ggplot2::geom_point(ggplot2::aes(x = Variable, y = Coefficient), :
## Ignoring unknown parameters: `linewidth`